ÁRBOL guided tutorial
Authors: Kyle Kimler and Ben Doran
Compiled: December 16, 2022
ARBOL iteratively explores axes of variation in scRNA-seq data by clustering and subclustering until variation between cells in subclusters becomes noise. The philosophy of ARBOL is that every axis of variation could be biologically meaningful, so each should be explored. Once every possibility is explored, curation and a statistical interrogation of the resolution are used to collapse clusters into the elemental transcriptomes of the dataset. ARBOL inherently builds a tree of subclustering events. As data is separated by the major axes of variation in each subset, further rounds capture less pronounced variables. Variation shared by all celltypes make up one of the major axes of variation in the first round of clustering, for example, inflammatory signatures. Celltypes can split up at the beginning, or the same splitting of celltypes (e.g. B and T cells) might happen further down in separate branches. Construction of a taxonomy of cell types, subtypes, and states, annotated automatically using a cell-state naming method that captures the unique markers of the end clusters among a user-defined clade, provides an understanding of relationships between the elemental transcriptomes of the dataset.
ARBOL was built to be a modular software. It iterates through the default Seurat analysis pipeline, with two additions that enable automation:
The main use case for this software is to help address the need to manually subcluster scRNA-seq data analyzed in the Seurat framework by providing tunable elements that can be consistently and unbiasedly applied to large-scale datasets. It is complementary to other packages such as scPanoView and iterclust built for specific clustering methods outside of Seurat. The ARBOL() iterative subclustering function is also available in scanpy python ARBOLpy and requires approximately half as much RAM and time to complete. All visualization methods used below are currently only available in R
Installation of ARBOL is currently done via install_github()
devtools::install_github('jo-m-lab/ARBOL',force=T)
suppressPackageStartupMessages({
#Matrix.utils should be installed automatically when you install ARBOL.
#you may have to restart your R kernel to reset the package after initial installation
#now that Matrix.utils has dropped off of CRAN, you should load it separately to ensure proper installation
library(Matrix.utils)
library(ARBOL)
library(tidyverse)
})
Warning message in system("timedatectl", intern = TRUE):
“running command 'timedatectl' had status 1”
finally, load a seurat object for ARBOL. This tutorial uses the following model dataset:
srobj <- readRDS('srobjs/polyp_srobj.rds')
dir.create('ARBOL_output/polyp_sctransform/endclusts')
Warning message in dir.create("ARBOL_output/polyp_sctransform/endclusts"):
“'ARBOL_output/polyp_sctransform/endclusts' already exists”
endclustDF <- ARBOL(srobj, cluster_assay = "SCT",
PreProcess_fun = PreProcess_sctransform,
ChooseOptimalClustering_fun = ChooseOptimalClustering_default,
saveSROBJdir='ARBOL_output/polyp_sctransform/srobjs',
figdir='ARBOL_output/polyp_sctransform/figs',
SaveEndNamesDir='ARBOL_output/polyp_sctransform/endclusts',
min_cluster_size = 100, max_tiers = 10, EnoughDiffUp = 5, EnoughDiffDown = 5,
res_scan_step = 39, res_scan_min = 0.01, res_scan_max = 3, res_scan_n = 40,
tierAllowedRecomb=2, DownsampleNum = 7500)
ARBOL() can take a long time. endclustDF provides a dataframe of tierN membership per cell, but you can also read in tiers output from the endclusts folder. You can also read the file "ARBOL_output/polyp_sctransform/endclusts.csv" to get tierN membership per cell as a dataframe.
#Oftentimes you have to replace Seurat metadata rownames after using tidyverse functions
#tidyverse throws them out because they don't like rownames. Seurat, unfortunately, uses them.
#https://hbctraining.github.io/Intro-to-R/lessons/08_intro_tidyverse.html
srobj@meta.data$CellID <- row.names(srobj@meta.data)
tiers <- ARBOL::LoadTiersAsDF('ARBOL_output/polyp_r_sctransform/endclusts')
srobj@meta.data <- srobj@meta.data %>% left_join(tiers)
row.names(srobj@meta.data) <- srobj@meta.data$CellID
Warning message: “Expected 10 pieces. Missing pieces filled with `NA` in 19196 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].” Joining, by = "CellID"
For ARBOL downstream analysis, you must provide a "sample" and "tierNident" (tierNident is included in the tiers dataframe in the previous cell) column in the Seurat object metadata
srobj@meta.data$sample <- srobj@meta.data$orig.ident
You can build and visualize a tree of clustering events using the function ARBOLcentroidTaxonomy(). Tree-building methods in ARBOL currently support categorical, count, and diversity data, adding these attributes to each node of a data.tree object and a ggraph object. These objects and metrics associated with them are added to the @misc slot of the seurat object.
srobj <- subclusteringTree(srobj, categories = c('sample','polyp','subset'),
counts = 'polyp', diversities = c('sample','polyp'), diversity_metric = 'simpson')
Warning message: “Expected 10 pieces. Missing pieces filled with `NA` in 18036 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].” Joining, by = "name"
ggraph supports diverse methods for visualization. Here, the clustering tree is visualized with heights corresponding to tier using layout='tree'. polyp status is plotted per edge to get a rough visualization at which tier clustering separates disease status.
See https://www.data-imaginist.com/2017/ggraph-introduction-layouts/ for a tutorial on ggraph.
options(repr.plot.width=16,repr.plot.height=8)
bt1 <- ggraph(srobj@misc$cluster_ggraph, layout = 'tree') +
geom_edge_elbow2(aes(color=node.polyp_majority)) + scale_edge_colour_manual(values=c('YES'='red3','NO'='blue3')) +
new_scale('color') +
geom_node_point(size=0) +
new_scale_color() +
theme_void()
bt1
ggraph allows addition of node points and text. Once we have rough celltype labels for each cell, we can plot these as text per end-cluster, and color the branches according to where these celltypes are divided in ARBOL(). Here, layout='dendrogram' allows placement of these points and text on only the "leaf" nodes. We like to plot the exact number of cells in each leaf-node alongside celltype membership, as visualizing sizes directly is often uninformative (as seen below)
options(repr.plot.width=16,repr.plot.height=8)
bt1 <- ggraph(srobj@misc$cluster_ggraph, layout = 'dendrogram') +
geom_edge_elbow2(aes(color=node.subset_majority)) +
new_scale('color') +
geom_node_point(aes(size=n)) +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority), nudge_y=-0.8,vjust=0.5,hjust=0,angle=270,size=2.5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.2,vjust=0.5,hjust=0,size=2.5,angle=270) +
theme_void() +
expand_limits(y=-5)
bt1
Instead of node sizes, we like to plot sample diversity as colored boxes per node. This gives us an idea of where clustering finds sample-specific modules
options(repr.plot.width=16,repr.plot.height=8)
clust_tree <- ggraph(srobj@misc$cluster_ggraph, layout = 'dendrogram') +
geom_edge_elbow2(aes(color=node.subset_majority)) +
new_scale('color') +
geom_node_point(size=0) +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority), nudge_y=-0.8,vjust=0.5,hjust=0,angle=270,size=2.5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.2,vjust=0.5,hjust=0,size=2.5,angle=270) +
theme_void() +
geom_node_point(aes(color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10') +
expand_limits(y=-5)
clust_tree
ggraph supports a diverse array of visualization layouts for our trees
options(repr.plot.width=10)
ggraph(srobj@misc$cluster_ggraph, 'treemap', weight = n) +
geom_node_tile(aes(fill = subset_majority, filter = leaf, alpha = depth), colour = NA) +
# geom_node_text(aes(filter = leaf,label=tierNident_majority)) +
geom_node_tile(aes(size = depth), colour = 'white') +
geom_node_tile(aes(filter = depth < 2)) +
scale_alpha(range = c(1, 0.5)) +
scale_size(range = c(4, 0.2), guide = 'none') + theme_classic()
Non-leaf weights ignored
options(repr.plot.width=16,repr.plot.height=6.75)
ggraph(srobj@misc$cluster_ggraph, layout='tree', circular=T) + geom_edge_diagonal2(aes(color=node.subset_majority)) + theme_void() +
ggraph(srobj@misc$cluster_ggraph, layout='circlepack', circular=T) +
geom_edge_diagonal2(aes(color=node.subset_majority),strength=1) +
geom_node_point(aes(color=subset_majority)) +
theme_classic()
options(repr.plot.width=16,repr.plot.height=4)
ggraph(srobj@misc$cluster_ggraph, layout='tree') + geom_edge_diagonal() + theme_void()
ggraph(srobj@misc$cluster_ggraph, layout='tree') + geom_edge_fan2(aes(color=node.subset_majority)) + theme_void()
In analysis of ARBOL tree-structure, we have found that clustering trees are not reflective of cluster relationships. To build a tree that reflects these relationships, we calculate distances between each clusters' feature-centroids in unreduced gene space or in a latent space.
ARBOL tree-building functions calculate a taxonomy of end-cluster relationships using the pvclust() package. Tree-building is wrapped in the function ARBOLcentroidTaxonomy(), which enables allotment of the same classes of variables as used above to the tree's nodes.
This function allows user-input of most pvclust parameters, including hclust method and distance methods, as well as parameters to determine what data to input to the tree calculation. You can specify any latent space in the Embeddings() of your seurat object or you can calculate centroids per cluster of a subset of genes, using the parameters tree_reduction='centroids' and gene_list. gene_list defaults to using all genes. Binary tree calculation is much faster if performed on a dimensionality-reduced space - especially useful for large datasets
srobj <- suppressMessages(RunPCA(srobj))
library(tictoc)
tic()
srobj <- suppressWarnings(suppressMessages(ARBOLcentroidTaxonomy(srobj, categories = c('polyp','subset'), diversities = c('sample','polyp'),
counts = c('polyp','subset'),
tree_reduction='pca',centroid_method = 'mean', distance_method = cosine,
hclust_method='complete',nboot=100)))
toc()
Bootstrap (r = 0.48)... Done. Bootstrap (r = 0.6)... Done. Bootstrap (r = 0.68)... Done. Bootstrap (r = 0.8)... Done. Bootstrap (r = 0.88)... Done. Bootstrap (r = 1.0)... Done. Bootstrap (r = 1.08)... Done. Bootstrap (r = 1.2)... Done. Bootstrap (r = 1.28)... Done. Bootstrap (r = 1.4)... Done. 142.799 sec elapsed
A basic visualization is added to the @misc slot of the seurat object.
options(repr.plot.width=16,repr.plot.height=6)
srobj@misc$taxViz
Using the same attributes as in the cluster trees in the previous section, plotting the newly calculated taxonomy shows celltypes grouped closer together
options(repr.plot.width=16,repr.plot.height=4)
clust_tree + ggtitle('clustering tree') + theme(legend.position='None')
tax_tree <- ggraph(srobj@misc$tax_ggraph, layout = 'dendrogram', height=plotHeight*3) +
geom_edge_elbow2(aes(color=node.subset_majority)) +
new_scale('color') +
geom_node_point(size=0) +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority), nudge_y=-1.5,vjust=0.5,hjust=0,angle=270,size=2.5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.6,vjust=0.5,hjust=0,size=2.5,angle=270) +
theme_void() +
geom_node_point(aes(color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10') +
expand_limits(y=-3) + theme(legend.position='None') + ggtitle('ARBOL taxonomy')
tax_tree
Warning message: “Existing variables `leaf` overwritten by layout variables”
pvclust bootstrapped p-values for branching points are included in the ggraph object, and can be directly visualized on the ggraph (we recommend filtering the lower splits out). Good to keep in mind here that pvclust bootstraps on features (which are our gene-centroids per cluster in this case), which we find not as reflective of certainty as would be a bootstrap of samples
options(repr.plot.width=12,repr.plot.height=6)
library(ggforce)
#add a filter to the ggraph to remove lower splits from labeling (to avoid clutter)
srobj@misc$tax_ggraph <- srobj@misc$tax_ggraph %>% activate(nodes) %>%
mutate(swoosh_alpha=ifelse(pval > 0.1,abs(log(pval)),1),sig=format(round(pval),nsmall=3),
label.alpha = ifelse(is.na(pval),0,1)) %>% mutate(sig = ifelse(is.na(pval) | plotHeight < 0.1,'',sig))
#use a swoosh-type tree with transparency scaled to pval to visualize tree
ggraph(srobj@misc$tax_ggraph, layout='dendrogram', height=plotHeight*10) +
geom_edge_diagonal2(aes(label=node.sig,color=node.subset_majority,alpha=node.swoosh_alpha),label_dodge=unit(-2,'mm'),angle_calc='along') + theme_classic() +
coord_flip() + scale_y_reverse() + ggtitle('PCA mean-centroids cosine distance pvclust bootstrap p-value')
Warning message: “Existing variables `leaf` overwritten by layout variables”
When we calculate the taxonomy using median centroids of all genes, centroid-bootstrap finds higher p-values for many branching points, indicating instability in some T and apical celltype clade placement
tic()
srobj <- suppressWarnings(suppressMessages(ARBOLcentroidTaxonomy(srobj, categories = c('tier1','polyp','subset'), diversities = c('sample','polyp'),
counts = c('polyp','subset'),
tree_reduction='centroids',centroid_method = 'median', distance_method = cosine,
hclust_method='complete')))
toc()
Bootstrap (r = 0.5)... Done. Bootstrap (r = 0.6)... Done. Bootstrap (r = 0.7)... Done. Bootstrap (r = 0.8)... Done. Bootstrap (r = 0.9)... Done. Bootstrap (r = 1.0)... Done. Bootstrap (r = 1.1)... Done. Bootstrap (r = 1.2)... Done. Bootstrap (r = 1.3)... Done. Bootstrap (r = 1.4)... Done. 66.166 sec elapsed
options(repr.plot.width=12,repr.plot.height=6)
#filter out lower splits from labeling
srobj@misc$tax_ggraph <- srobj@misc$tax_ggraph %>% activate(nodes) %>%
mutate(swoosh_alpha=ifelse(pval > 0.1,abs(log(pval)),1),sig=format(round(pval),nsmall=3),
label.alpha = ifelse(is.na(pval),0,1)) %>% mutate(sig = ifelse(is.na(pval) | plotHeight < 0.1,'',sig))
#use a swoosh-type tree with transparency scaled to pval to visualize tree
ggraph(srobj@misc$tax_ggraph, layout='dendrogram', height=plotHeight*10) +
geom_edge_diagonal2(aes(label=node.sig,color=node.subset_majority,alpha=node.swoosh_alpha),label_dodge=unit(-2,'mm'),angle_calc='along') + theme_classic() +
coord_flip() + scale_y_reverse() + ggtitle('all-gene median-centroids cosine distance pvclust bootstrap p-value')
Warning message: “Existing variables `leaf` overwritten by layout variables”
We hope to enable ARBOL users to define their own taxonomy calculation methods.
Please reach out with suggestions for tree-building methods that may not be included in pvclust()
In addition to taxonomy layouts we've used so far, ggraph supports advanced ggplot functions via ggforce and ggbuild. Here we showcase additional methods for interrogating the tree, including methods for superimposing geoms to ggraph builds, alongside ggforce's zoom and marking
library(ggforce)
library(scatterpie)
library(RColorBrewer)
Using pieify_tree_plot() you can add pie charts to a ggraph tree using scatterpie hacked to attach to a tree with a geom_node_point() layer. Add a geom_node_point() to your ggraph plot and then run pieify_tree_plot() on the resulting plot to add pie charts to either taxonomy or subclustering trees
options(repr.plot.width=16,repr.plot.height=12)
#multiply plot height by 150 to scale pie charts better
treeplot <- ggraph(srobj@misc$tax_ggraph, layout='dendrogram', height=plotHeight*150) +
geom_edge_elbow2(aes(color=node.subset_majority),edge_width=1) +
scale_edge_color_brewer(palette='Spectral') +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = subset_majority %>% str_replace_all('_','/'), color = subset_majority),
nudge_y=-4,vjust=0.5,hjust=0,angle=270,size=8) +
scale_color_brewer(palette='Spectral') +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.6,vjust=0.5,hjust=0,size=6,angle=270) +
theme_classic() +
geom_node_point(aes(filter = leaf, color=sample_diversity),size=4,shape='square') +
scale_color_gradient(low='grey90',high='grey10') +
expand_limits(y=-10) +
# coord_cartesian(xlim = c(40,75)) +
#add an invisible point to each node for use in ggbuild hacking done by pieify_tree_plot()
geom_node_point(aes(label=string),size=0)
#when using pieify_tree_plot in taxonomy mode, we need to set scaling_factor to 1
treeplot <- pieify_tree_plot(ggraph_plot = treeplot, srobj=srobj, color_metadata='subset',
mode='taxonomy', y_cutoff=10, radius=2, scaling_factor=1) + scale_fill_brewer(palette='Spectral')
treeplot
Warning message:
“Existing variables `leaf` overwritten by layout variables”
Warning message in geom_node_point(aes(label = string), size = 0):
“Ignoring unknown aesthetics: label”
Joining, by = "string"
options(repr.plot.width=16,repr.plot.height=8)
#in subclusters mode, scaling factor should be input to the pieify_tree_plot function instead
scaling_factor = 20
treeplot <- ggraph(srobj@misc$cluster_ggraph, layout = 'dendrogram') +
geom_edge_elbow2(aes(color=node.polyp_majority,y=y*scaling_factor)) + scale_edge_color_manual(values=c('YES'='red3','NO'='blue3')) +
geom_node_text(aes(filter = leaf, label = name, color = subset_majority), nudge_y=-1,vjust=0.5,hjust=0,angle=270,size=3) +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.2,vjust=0.5,hjust=0,size=3,angle=270) +
theme_classic() +
new_scale_color() +
geom_node_point(aes(filter = leaf,color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10') +
expand_limits(y=(-1*scaling_factor)) +
#add an invisible point to each node for use in ggbuild hacking - put it at the end
geom_node_point(aes(label=name),size=0)
#pieify_tree_plot can use an additional parameter - scaling_factor
#since ggraph widths are dependent on leaves and height in subclustering is 1 per tier,
#scaling factor scales up heights so the pies aren't stretched. 20 is a good default.
treeplot <- pieify_tree_plot(ggraph_plot=treeplot, srobj=srobj, color_metadata='polyp',
mode='subclusters', scaling_factor=scaling_factor, radius=3) +
scale_fill_manual(values=c('polyp_n_YES'='red3','polyp_n_NO'='blue3'))
treeplot
Warning message in geom_node_point(aes(label = name), size = 0):
“Ignoring unknown aesthetics: label”
Joining, by = "name"
ggforce also enables users to create a multi-layer graph with a zoomed in portion using facet_zoom()
#which colors do we want to hack into our ggraph? Here, I use rough celltype aka "subset"
#I use colorRamps to get highly contrastive colors
suppressPackageStartupMessages(library(colorRamps))
celltype.colors = rev(primary.colors(n=length(srobj@meta.data$subset %>% unique)))
b <- ggraph(srobj@misc$tax_ggraph, layout='dendrogram', height=plotHeight*10) +
geom_edge_elbow2(aes(color=node.subset_majority)) + scale_edge_color_manual(values=celltype.colors) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = subset_majority %>% str_replace_all('_','/'), color = subset_majority), nudge_y=-2,vjust=0.5,hjust=0,angle=270,size=2) +
scale_color_manual(values=celltype.colors) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.4,vjust=0.5,hjust=0,size=3,angle=270) +
theme_classic() +
geom_node_point(aes(filter = leaf, color=sample_diversity),size=4,shape='square') +
scale_color_gradient(low='grey90',high='grey10') +
expand_limits(y=-5) + facet_zoom(xlim = c(35,70),ylim=c(-10,5),horizontal=F,zoom.size=1)
pb <- ggplot_build(b)
#facet_zoom creates a PANEL column in ggbuild, for which PANEL 4 is the zoomed-in facet.
#Using PANEL==4 restricted mutates I can control the size and placement of ggplot aspects only within the zoomed-in
#part of the plot
#plot 1 is the edges of the dendrogram
pb$data[[1]] <- pb$data[[1]] %>% mutate(edge_width=ifelse(PANEL==4,2,edge_width))
#plot 2 is the label geom_node_text
pb$data[[2]] <- pb$data[[2]] %>% mutate(size=ifelse(PANEL==4,7,size),y=ifelse(PANEL==4,y-1.5,y))
#plot 3 is the cell number geom_node_text
pb$data[[3]] <- pb$data[[3]] %>% mutate(size=ifelse(PANEL==4,7,size),y=ifelse(PANEL==4,y-0.5,y))
#plot 4 is the diversity geom_node_point - no need to move this one down
pb$data[[4]] <- pb$data[[4]] %>% mutate(size=ifelse(PANEL==4,7,size))
plot(ggplot_gtable(pb))
Warning message: “Existing variables `leaf` overwritten by layout variables”
It's also possible to zoom using coord_cartesian() to crop the graph rather than producing a double graphic.
Both of these methods can be hacked to contain geom_scatterpie() as well
cc <- ggraph(srobj@misc$tax_ggraph, layout='dendrogram', height=plotHeight*10) +
geom_edge_elbow2(aes(color=node.subset_majority),edge_width=1) + scale_edge_color_manual(values=celltype.colors) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority),
nudge_y=-1.8,vjust=0.5,hjust=0,angle=270,size=8) +
scale_color_manual(values=celltype.colors) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.2,vjust=0.5,hjust=0,size=8,angle=270) +
theme_classic() +
geom_node_point(aes(filter = leaf, color=sample_diversity),size=4,shape='square') +
scale_color_gradient(low='grey90',high='grey10') +
expand_limits(y=-5) +
coord_cartesian(xlim=c(25,60),ylim=c(-5,10)) +
geom_node_point(aes(label=string),size=0)
pieify_tree_plot(cc, srobj, color_metadata='subset',y_cutoff=0.8,radius=0.4,mode='taxonomy')
Warning message:
“Existing variables `leaf` overwritten by layout variables”
Warning message in geom_node_point(aes(label = string), size = 0):
“Ignoring unknown aesthetics: label”
Joining, by = "string"
taxonomy building metrics are saved in the @misc slot in case you need them later
#but don't print the gene_list because that defaults to all 20k+ genes!
srobj@misc$tree_metrics[names(srobj@misc$tree_metrics)!='gene_list']
function (x)
{
x <- as.matrix(x)
y <- t(x) %*% x
res <- 1 - y/(sqrt(diag(y)) %*% t(sqrt(diag(y))))
res <- as.dist(res)
attr(res, "method") <- "cosine"
return(res)
}ARBOL provides methods for pruning the tree based on any of the variables included in its calculation. These methods merge clusters up into their branch in the tree. Although ARBOL() tiered hierarchical clustering attempts to find the right resolution for clustering analysis, different datasets and analyses call for different resolutions.
For example, with ARBOL you can merge based on a sample diversity threshold, in case end clusters are dominated by sample-specific effects.
There are two functions for merging clusters - the first is more stable than the second but is restricted to only cell number per cluster (size_threshold) and sample diversity (sample_diversity_threshold)
Data.tree objects have a strange property of only existing in one place in an object, so we have to manually save a copy of the tree outside this environment to be able to modulate the tree without losing the old one
#save raw tree for tree modulation
saveRDS(srobj@misc$taxTree,'raw_polyp_taxonomy.rds')
srobj <- mergeEndclusts(srobj, sample_diversity_threshold = 0.2, size_threshold = 100)
Joining, by = "CellID"
After merging clusters, a new tierNident column is placed in the seurat object's metadata. At this point we must re-calculate the tree to make sure nodes that have unique cells are called end-clusters.
srobj <- suppressWarnings(suppressMessages(ARBOLcentroidTaxonomy(srobj, categories = c('polyp','subset'),
diversities = c('sample','polyp'),
counts = c('polyp','subset'),
tree_reduction='pca',centroid_method = 'mean',
distance_method = cosine, hclust_method='complete')))
Bootstrap (r = 0.48)... Done. Bootstrap (r = 0.6)... Done. Bootstrap (r = 0.68)... Done. Bootstrap (r = 0.8)... Done. Bootstrap (r = 0.88)... Done. Bootstrap (r = 1.0)... Done. Bootstrap (r = 1.08)... Done. Bootstrap (r = 1.2)... Done. Bootstrap (r = 1.28)... Done. Bootstrap (r = 1.4)... Done.
options(repr.plot.width=10,repr.plot.height=5)
ggraph(srobj@misc$tax_ggraph, layout = 'dendrogram', height=plotHeight*10) +
geom_edge_elbow2(aes(color=node.polyp_majority)) + scale_edge_color_manual(values=c('YES'='red3','NO'='blue3')) +
new_scale('color') +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority), nudge_y=-2.5,vjust=0.5,hjust=0,angle=270,size=5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.5,vjust=0.5,hjust=0,size=4,angle=270) +
theme_void() +
geom_node_point(aes(filter = leaf,color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10',limits=c(0,1)) +
expand_limits(y=-8)
Warning message: “Existing variables `leaf` overwritten by layout variables”
If you're unsatisfied with the merge result, reset the seurat object and merge again with different parameters
rawTree <- readRDS('raw_polyp_taxonomy.rds')
srobj@misc$taxTree <- rawTree
#replace tierNident with rawIdent and remove rawIdent to allow new merge
srobj@meta.data$tierNident <- srobj@meta.data$rawIdent
srobj@meta.data <- srobj@meta.data %>% dplyr::select(-rawIdent)
srobj <- mergeEndclusts(srobj, sample_diversity_threshold = 0, size_threshold=50)
Joining, by = "CellID"
srobj <- suppressWarnings(suppressMessages(ARBOLcentroidTaxonomy(srobj, categories = c('polyp','subset'), diversities = c('sample','polyp'),
counts = c('polyp','subset'),
tree_reduction='pca',centroid_method = 'mean', distance_method = cosine,
hclust_method='complete')))
Bootstrap (r = 0.48)... Done. Bootstrap (r = 0.6)... Done. Bootstrap (r = 0.68)... Done. Bootstrap (r = 0.8)... Done. Bootstrap (r = 0.88)... Done. Bootstrap (r = 1.0)... Done. Bootstrap (r = 1.08)... Done. Bootstrap (r = 1.2)... Done. Bootstrap (r = 1.28)... Done. Bootstrap (r = 1.4)... Done.
options(repr.plot.width=16,repr.plot.height=5)
ggraph(srobj@misc$tax_ggraph, layout = 'dendrogram', height=plotHeight*10) +
geom_edge_elbow2(aes(color=node.subset_majority)) + #scale_edge_color_manual(values=c('YES'='red3','NO'='blue3')) +
new_scale('color') +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority), nudge_y=-2.5,vjust=0.5,hjust=0,angle=270,size=5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.5,vjust=0.5,hjust=0,size=4,angle=270) +
theme_void() +
geom_node_point(aes(filter = leaf,color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10',limits=c(0,1)) +
expand_limits(y=-8)
Warning message: “Existing variables `leaf` overwritten by layout variables”
Finally, it's possible to merge clusters using a cutree() like algorithm, where you choose the number of clusters you want and the tree is cut horizontally to produce that number of clusters, as displayed below. To do so, add a rank to each internal node of the tree and merge based on a boolean of the rank
options(repr.plot.width=16,repr.plot.height=5)
precut <- ggraph(srobj@misc$tax_ggraph, layout = 'dendrogram', height=plotHeight*10) +
geom_edge_elbow2(aes(color=node.subset_majority)) + #scale_edge_color_manual(values=c('YES'='red3','NO'='blue3')) +
new_scale('color') +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority), nudge_y=-2.5,vjust=0.5,hjust=0,angle=270,size=5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.5,vjust=0.5,hjust=0,size=4,angle=270) +
theme_void() +
geom_node_point(aes(filter = leaf,color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10',limits=c(0,1)) +
expand_limits(y=-8) + geom_hline(yintercept=1, color='red3')
precut
Warning message: “Existing variables `leaf` overwritten by layout variables”
rawTree <- readRDS('raw_polyp_taxonomy.rds')
srobj@misc$taxTree <- rawTree
#replace tierNident with rawIdent and remove rawIdent to allow new merge
srobj@meta.data$tierNident <- srobj@meta.data$rawIdent
srobj@meta.data <- srobj@meta.data %>% dplyr::select(-rawIdent)
#add plot height rank k per node
heights <- data.frame(heights=srobj@misc$taxTree$Get(function(node) node$parent$plotHeight),
ranks=rank(srobj@misc$taxTree$Get(function(node) node$parent$plotHeight)) %>% unname)
heights$heights[1] <- 1
srobj@misc$taxTree$Do(function(node) node$k <- heights[match(node$parent$plotHeight,heights$heights),]$rank)
srobj@misc$taxTree$k <- max(unlist(srobj@misc$taxTree$Get('k')))+1
#calculate proper threshold for 50 clusters
cut = 50
threshold <- max(unlist(srobj@misc$taxTree$Get('k')))-(cut*2-1)
#ARBOL has a second mergeEndclusts function that enables users to provide their own metadata for merge thresholds
#threshold_attributes reads in a list of metadata columns on which to merge
#tresholds is the list of thresholds on which to merge the threshold attributes, in the same order as threshold_attributes
#this function tends to be unstable so use of a single treshold_attribute at a time is preferred
#(or use the previous function above)
#the output of this function places the new tree
#cut the taxonomy according to node rank
srobj <- mergeEndclustsCustom(srobj, threshold_attributes = 'k', thresholds = threshold)
#make sure you have the number of clusters you want
srobj@meta.data$tierNident %>% unique %>% length
Joining, by = "CellID"
srobj <- suppressWarnings(suppressMessages(ARBOLcentroidTaxonomy(srobj, categories = c('polyp','subset'), diversities = c('sample','polyp'),
counts = c('polyp','subset'),
tree_reduction='pca',centroid_method = 'mean', distance_method = cosine,
hclust_method='complete')))
Bootstrap (r = 0.48)... Done. Bootstrap (r = 0.6)... Done. Bootstrap (r = 0.68)... Done. Bootstrap (r = 0.8)... Done. Bootstrap (r = 0.88)... Done. Bootstrap (r = 1.0)... Done. Bootstrap (r = 1.08)... Done. Bootstrap (r = 1.2)... Done. Bootstrap (r = 1.28)... Done. Bootstrap (r = 1.4)... Done.
options(repr.plot.width=12,repr.plot.height=5)
aftercut <- ggraph(srobj@misc$tax_ggraph, layout = 'dendrogram', height=plotHeight*10) +
geom_edge_elbow2(aes(color=node.polyp_majority)) + scale_edge_color_manual(values=c('YES'='red3','NO'='blue3')) +
new_scale('color') +
geom_node_text(aes(filter = leaf, label = subset_majority, color = subset_majority), nudge_y=-3,vjust=0.5,hjust=0,angle=270,size=5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=-0.5,vjust=0.5,hjust=0,size=4,angle=270) +
theme_void() +
geom_node_point(aes(filter = leaf,color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10',limits=c(0,1)) +
expand_limits(y=-8)
Warning message: “Existing variables `leaf` overwritten by layout variables”
precut
aftercut
ARBOL's naming function provides users with an automatic method to name sub-clusters. The most important parameter, "celltype_col" tells the function which clade to perform 1 vs. rest Wilcoxon for each endclust. The genes a standard name is based on will be those that make it unique among the clade specified.
figdir allows you to save the wilcoxon results. If there is a getStandardNames() run in the target figdir already, it will re-load that result instead of calculating again.
max_cells_per_ident can speed up wilcoxon calculation substantially. We have noted stability in multiple runs of getStandardNames() even when only using 100 cells per ident. It's also possible to speed up the wilcoxon by calculating names among finer-grained celltype_col clades. If there are fewer clusters per clade, wilcoxon is faster.
finally, standardname_col is the seurat object metadata column into which to place the resulting standard names
There is a catch here. If some cell states have multiple celltype_col membership, then we need to create a different name for that cell state based on each celltype in which it has membership. When this happens, GetStandardNames() will create a separate column in the metadata for each possible celltype's standard name. The standardname_col you specify in the parameter will be filled with the majority celltype-name for that cell state
# remember to check your seurat object metadata for rownames -
# seurat needs them but tidyverse removes them
row.names(srobj@meta.data) <- srobj@meta.data$CellID
#reminder: getStandardNames saves output wilcoxon to speed up re-runs of the pipeline.
#if we want to recalculate the names and use the same directory to save, we need to remove the old results.
system('rm -r polyp_cutThreshold_testing')
dir.create('polyp_cutThreshold_testing')
srobj <- getStandardNames(srobj, figdir = 'polyp_cutThreshold_testing',
celltype_col = 'subset', max_cells_per_ident=100,
standardname_col = 'standardName')
Calculating cluster Root.2.2.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.2.2.2.1.2.2
Calculating cluster Root.1.2.2.2.2.2.1
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.2.2.2.2.1.1.2
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.2.2.2.2.2.2.1.2
Calculating cluster Root.2.1.2.T1C1.T2C8
Calculating cluster Root.2.2.1.2.2.2.2.1
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.1.2.2.2.2.2.2
Calculating cluster Root.1.2.1.2.1
Calculating cluster Root.1.2.1.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.1
Calculating cluster Root.2.2.2.2.2.2.2.2.2
Calculating cluster Root.2.2.1.2.2.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.2.2.1.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.2.2.1.2.2.1.2.1
Calculating cluster Root.2.2.2.1.1.2.2
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.2.2.2.2.2
Calculating cluster Root.1.2.2.2.2.2.2
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.2.2.2.1.2.2.T1C0.T2C4.T3C8
Calculating cluster Root.2.2.2.1.1.1
Calculating cluster Root.2.2.2.1.2.2.2.T1C7.T2C8
Calculating cluster Root.2.2.1.2.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.2.1
Calculating cluster Root.2.2.1.2.1.2
Calculating cluster Root.2.2.2.1.1.2.T1C0.T2C4.T3C4
Calculating cluster Root.2.1.2.2.1
Calculating cluster Root.2.1.2.2.2.2.1
Calculating cluster Root.2.2.1.1.2
Calculating cluster Root.2.2.1.1.1
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.2.1.2.2.1
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.2.2.1.2.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.2.2.2.2.1
Calculating cluster Root.1.2.2.1.2.2.2
Calculating cluster Root.2.1.2.2.2.2.2
Calculating cluster Root.2.1.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.2.2.2.2.2
Calculating cluster Root.2.2.1.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.1.2.2.2.T1C1.T2C9
Calculating cluster Root.1.2.1.2.1
Calculating cluster Root.2.2.2.2.2.2.2.2.2
Calculating cluster Root.1.1
Calculating cluster Root.1.2.2.1.1
Calculating cluster Root.1.2.1.2.2.1
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.1.2.2.1.2.2.1
Calculating cluster Root.1.2.2.1.2.1
Calculating cluster Root.1.2.1.2.2.2
Calculating cluster Root.1.2.2.1.2.2.2
Calculating cluster Root.1.2.2.2.2.T1C3.T2C0.T3C9
Calculating cluster Root.1.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.1.2.2.2.2.2.2
Calculating cluster Root.1.2.1.1
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.1.2.2.2.T1C1.T2C9
Calculating cluster Root.2.2.2.1.1.2.2
Calculating cluster Root.1.2.2.2.2.2.1
Calculating cluster Root.2.2.2.2.1.1.2
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.2.2.2.1.1.1
Calculating cluster Root.2.2.1.2.2.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.2.2.2.1.1.2.2
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.2.2.2.1.2.2.T1C0.T2C4.T3C8
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.2.1.1.2.T1C0.T2C4.T3C4
Calculating cluster Root.1.2.2.2.2.2.2
Calculating cluster Root.2.2.1.2.2.2.2.2.2.2
Calculating cluster Root.2.2.2.1.1.1
Calculating cluster Root.2.1.2.2.2.2.1
Calculating cluster Root.2.2.2.2.1.2.2
Calculating cluster Root.2.2.2.2.1.1.2
Calculating cluster Root.2.2.1.2.2.2.2.1
Calculating cluster Root.2.1.2.2.2.T1C1.T2C9
Calculating cluster Root.2.2.1.2.2.2.2.1
Calculating cluster Root.2.1.2.2.1
Calculating cluster Root.2.2.1.2.1.1
Calculating cluster Root.2.2.1.2.2.2.2.2.2.2
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.2.1.2.2.2.2.1
Calculating cluster Root.2.1.1
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.2.1.2.2.2.2.2
Calculating cluster Root.2.2.1.2.1.2
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.2.1.2.2.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.2.2.1.2.2.2.2.2.1
Calculating cluster Root.2.2.2.1.1.1
Calculating cluster Root.2.1.2.T1C1.T2C8
Calculating cluster Root.2.2.1.2.2.2.1
Calculating cluster Root.2.2.2.1.2.2.T1C0.T2C4.T3C8
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.2.2.2.2.2.1.1
Calculating cluster Root.2.2.2.2.2.1.2
Calculating cluster Root.2.2.2.2.2.2.1.T1C7.T2C9
Calculating cluster Root.2.2.2.2.2.2.2.2.2
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.2.2.2.2.2.T1C7.T2C11
Calculating cluster Root.2.2.1.2.2.2.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.1.2.2.2.2.1
Calculating cluster Root.1.2.2.2.2.T1C3.T2C0.T3C9
Calculating cluster Root.1.2.2.1.1
Calculating cluster Root.2.2.2.1.2.2.2.T1C7.T2C8
Calculating cluster Root.1.2.2.1.2.2.1
Calculating cluster Root.1.2.2.2.2.2.2
Calculating cluster Root.2.2.2.2.1.1.2
Calculating cluster Root.2.2.2.2.1.1.1
Calculating cluster Root.2.2.2.2.1.2.1
Calculating cluster Root.2.2.2.2.1.2.2
Calculating cluster Root.2.2.2.2.1.1.2
Calculating cluster Root.2.2.1.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.2.2.1.2.2.2.2.2.2.1
Calculating cluster Root.2.2.2.1.1.2.2
Calculating cluster Root.2.2.2.1.1.1
Calculating cluster Root.2.2.1.2.2.2.2.2.2.2
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.2.2.2.2.2.2.2
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.2.1.2.T1C1.T2C8
Calculating cluster Root.2.2.1.2.1.2
Calculating cluster Root.2.2.1.2.2.2.2.1
Calculating cluster Root.1.2.2.2.2.2.1
Calculating cluster Root.2.2.2.2.2.1.2
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.2.2.2.2.2.2.1.2
Calculating cluster Root.2.2.2.2.2.2.1.T1C7.T2C9
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.1.2.1.2.1
Calculating cluster Root.1.2.1.2.2.1
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.1.2.2.2.2.T1C3.T2C0.T3C9
Calculating cluster Root.1.2.2.1.2.2.1
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.1.2.1.2.2.2
Calculating cluster Root.1.2.2.2.2.2.2
Calculating cluster Root.2.2.2.2.2.2.2.2.2
Calculating cluster Root.2.2.2.2.2.2.2.2.T1C7.T2C10
Calculating cluster Root.2.2.2.1.2.2.2.T1C7.T2C8
Calculating cluster Root.2.2.1.2.2.1.2.2.1
Calculating cluster Root.2.2.1.2.2.1.2.2.2
Calculating cluster Root.2.2.2.1.2.2.2.2
Calculating cluster Root.2.2.2.1.2.1
Calculating cluster Root.2.2.1.2.2.1.1
Calculating cluster Root.2.2.2.2.2.2.2.T1C7.T2C11
Calculating cluster Root.2.2.2.1.1.1
Calculating cluster Root.1.2.1.2.1
Calculating cluster Root.2.2.1.1.1
Calculating cluster Root.2.1.2.2.1
Calculating cluster Root.2.1.2.2.2.2.1
Calculating cluster Root.2.2.2.2.1.1.2
Calculating cluster Root.1.2.2.2.2.2.2
Calculating cluster Root.2.2.2.2.2.2.1.T1C7.T2C9
Calculating cluster Root.1.2.2.2.1
Calculating cluster Root.1.2.2.1.2.2.1
Calculating cluster Root.1.2.2.1.2.1
Calculating cluster Root.2.2.2.2.1.2.1
Calculating cluster Root.1.2.1.2.2.2
number of tierNident-celltype pairs for which at least 2 biomarkers were found: 115
total number of end clusters: 50
Joining, by = c("tierNident", "standardName")
#now that standard names have been applied to cell states,
#if you have multiple celltypes per cell state, it's usually best to decide on a single celltype per cellstate,
#so let's look quickly at the names and change those that need changing
#the quickest way to pick a type for each state is to take the majority off the front of the standard name
#gather all standard name - celltype pairs for which there are multiple celltypes
#count how many subsets there are per standardName
change <- srobj@meta.data %>% dplyr::count(standardName, subset) %>%
#count the number of
dplyr::count(standardName) %>% dplyr::arrange(-n) %>% dplyr::group_by(standardName) %>%
dplyr::slice_max(order_by = n, n=1)
#pull majority celltype name (let's call it "subset" since that's the original celltype column)
change$subset <- sub('\\..*$','',change$standardName)
#remove celltype from seurat object and replace with standard name majority choice
srobj@meta.data <- srobj@meta.data %>% dplyr::select(-subset) %>% left_join(change %>% dplyr::select(-n))
Joining, by = "standardName"
srobj <- suppressWarnings(suppressMessages(remake_ggraph(srobj,
categories = c('polyp','standardName','subset'),
diversities = c('standardName','subset','polyp'),
counts=c('polyp','subset'))))
#let's rotate the graph this time to make it easier to read the new names
options(repr.plot.width=16,repr.plot.height=9)
ggraph(srobj@misc$tax_ggraph, layout = 'dendrogram', height=plotHeight*10) +
geom_edge_elbow2(aes(color=node.polyp_majority)) + scale_edge_color_manual(values=c('YES'='red3','NO'='blue3')) +
new_scale('color') +
geom_node_text(aes(filter = leaf, label = standardName_majority, color = subset_majority), nudge_y=2,vjust=0.5,hjust=0,size=5) +
new_scale_color() +
geom_node_text(aes(filter = leaf, label = n),color='grey30',nudge_y=0.5,vjust=0.5,hjust=0,size=4) +
theme_void() +
geom_node_point(aes(filter = leaf,color=sample_diversity),size=4,shape='square') + scale_color_gradient(low='grey90',high='grey10',limits=c(0,1)) +
coord_flip() + scale_y_reverse() +
expand_limits(y=-8)
Scale for y is already present. Adding another scale for y, which will replace the existing scale.
#calculate composition per sample
OTUlong <- srobj@meta.data %>%
#below here is a filtering step that I often do to include or
#exclude branches of the tree based on cell type, or if they're cells that I don't want to include.
#Often the difference between conditions can be diluted if you include certain celltypes,
filter(subset %in% c(
'Fibroblast',
'TCell',
'Basal',
'Myeloid',
'PlasmaCell',
'Ciliated',
'Apical',
'Glandular')) %>%
#make sure to have a sample column that includes the clinical metadata you want to interrogate
mutate(sample = paste(sample,polyp,sep='_')) %>%
#now the counting step is first sample, then tier1 any divisions you want to calculate, other wise just immediately follow with tierNident
dplyr::count(sample,subset,standardName) %>%
#now group by which tier we want to normalize by, here overall
group_by(sample) %>%
#count by cell count per sample
mutate(CPMtotal=n/sum(n)*1e4,nCells=sum(n)) %>%
#ungroup to count in other ways
ungroup %>%
#group now by sample to calculate composition of cell states per patient
group_by(sample,subset) %>%
#calc normalization
mutate(CPMtype=n/sum(n)*1e4,nt1=sum(n)) %>% ungroup %>% select(-subset)
suppressPackageStartupMessages(library(scales))
compTab <- OTUlong %>% select(sample,standardName,CPMtotal) %>%
pivot_wider(names_from='standardName',values_from='CPMtotal') %>%
column_to_rownames('sample')
compTab[is.na(compTab)] <- 0
compPCA <- prcomp(compTab)
varianceExplained <- compPCA$sdev^2/sum(compPCA$sdev^2)
compPCAloadings <- compPCA$rotation %>% data.frame %>% rownames_to_column('type') %>% arrange(desc(PC1))
pctlabels <- label_percent()(varianceExplained)
compPCAtab <- compPCA$x %>% data.frame %>% rownames_to_column('pat_polyp') %>% separate(pat_polyp,into=c('pat','polyp'),sep='_')
options(repr.plot.width=8,repr.plot.height=8)
mergedType <- ggplot(compPCAtab,aes(x=PC1,y=PC2,color=polyp)) + geom_point(size=5) +
ggtitle('Polyp cell state sample composition PCA') + theme_classic() +
scale_color_manual(limits=c("YES", "NO"),
values=c("YES"="#db324c","NO"="#32c4db")) +
xlab(sprintf('PC1 (%s)',pctlabels[1])) +
ylab(sprintf('PC2 (%s)',pctlabels[2])) + theme(axis.title=element_text(size=20))
mergedType
sessionInfo()
R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] scales_1.2.1 colorRamps_2.3.1 RColorBrewer_1.1-3 ggforce_0.4.1 [5] tictoc_1.0.1 forcats_0.5.2 purrr_0.3.5 readr_2.1.3 [9] tidyr_1.2.1 tibble_3.1.8 tidyverse_1.3.2 ARBOL_4.0.0 [13] dendextend_1.16.0 cluster_2.1.3 strex_1.4.4 stringr_1.5.0 [17] glmGamPoi_1.10.1 ggnewscale_0.4.8 scatterpie_0.1.8 ggalluvial_0.12.3 [21] phyloseq_1.42.0 ape_5.6-2 data.tree_1.0.0 ggraph_2.1.0 [25] tidygraph_1.2.2 ggrepel_0.9.2 ggpubr_0.5.0 ggfortify_0.4.15 [29] reshape2_1.4.4 vegan_2.6-4 lattice_0.20-45 permute_0.9-7 [33] harmony_0.1.1 Rcpp_1.0.9 pvclust_2.2-0 dplyr_1.0.10 [37] ggplot2_3.4.0 limma_3.54.0 future_1.30.0 data.table_1.14.6 [41] SeuratObject_4.1.3 Seurat_4.3.0 Matrix.utils_0.9.7 Matrix_1.5-3 loaded via a namespace (and not attached): [1] pbdZMQ_0.3-7 scattermore_0.8 [3] ragg_1.2.2 irlba_2.3.5.1 [5] DelayedArray_0.24.0 RCurl_1.98-1.9 [7] generics_0.1.3 BiocGenerics_0.44.0 [9] callr_3.7.3 cowplot_1.1.1 [11] usethis_2.1.6 RANN_2.6.1 [13] tzdb_0.3.0 xml2_1.3.3 [15] spatstat.data_3.0-0 lubridate_1.9.0 [17] httpuv_1.6.7 SummarizedExperiment_1.28.0 [19] assertthat_0.2.1 viridis_0.6.2 [21] gargle_1.2.1 hms_1.1.2 [23] evaluate_0.19 promises_1.2.0.1 [25] fansi_1.0.3 readxl_1.4.1 [27] dbplyr_2.2.1 igraph_1.3.5 [29] DBI_1.1.3 htmlwidgets_1.6.0 [31] spatstat.geom_3.0-3 googledrive_2.0.0 [33] stats4_4.2.1 ellipsis_0.3.2 [35] backports_1.4.1 sparseMatrixStats_1.10.0 [37] deldir_1.0-6 MatrixGenerics_1.10.0 [39] vctrs_0.5.1 Biobase_2.58.0 [41] Cairo_1.6-0 remotes_2.4.2 [43] ROCR_1.0-11 abind_1.4-5 [45] cachem_1.0.6 withr_2.5.0 [47] grr_0.9.5 progressr_0.12.0 [49] checkmate_2.1.0 sctransform_0.3.5 [51] prettyunits_1.1.1 goftest_1.2-3 [53] IRdisplay_1.1 lazyeval_0.2.2 [55] crayon_1.5.2 spatstat.explore_3.0-5 [57] labeling_0.4.2 pkgconfig_2.0.3 [59] tweenr_2.0.2 GenomeInfoDb_1.34.4 [61] vipor_0.4.5 nlme_3.1-158 [63] pkgload_1.3.2 devtools_2.4.4 [65] rlang_1.0.6 globals_0.16.2 [67] lifecycle_1.0.3 miniUI_0.1.1.1 [69] modelr_0.1.10 ggrastr_1.0.1 [71] cellranger_1.1.0 rprojroot_2.0.3 [73] polyclip_1.10-4 matrixStats_0.63.0 [75] lmtest_0.9-40 IRkernel_1.3 [77] carData_3.0-5 Rhdf5lib_1.20.0 [79] zoo_1.8-11 beeswarm_0.4.0 [81] reprex_2.0.2 base64enc_0.1-3 [83] ggridges_0.5.4 processx_3.8.0 [85] googlesheets4_1.0.1 png_0.1-8 [87] viridisLite_0.4.1 bitops_1.0-7 [89] KernSmooth_2.23-20 rhdf5filters_1.10.0 [91] Biostrings_2.66.0 DelayedMatrixStats_1.20.0 [93] parallelly_1.33.0 spatstat.random_3.0-1 [95] rstatix_0.7.1 S4Vectors_0.36.1 [97] ggsignif_0.6.4 memoise_2.0.1 [99] magrittr_2.0.3 plyr_1.8.8 [101] ica_1.0-3 zlibbioc_1.44.0 [103] compiler_4.2.1 fitdistrplus_1.1-8 [105] cli_3.4.1 ade4_1.7-20 [107] XVector_0.38.0 urlchecker_1.0.1 [109] listenv_0.9.0 patchwork_1.1.2 [111] pbapply_1.6-0 ps_1.7.2 [113] MASS_7.3-58.1 mgcv_1.8-40 [115] tidyselect_1.2.0 stringi_1.7.8 [117] textshaping_0.3.6 grid_4.2.1 [119] tools_4.2.1 timechange_0.1.1 [121] future.apply_1.10.0 parallel_4.2.1 [123] uuid_1.1-0 foreach_1.5.2 [125] gridExtra_2.3 farver_2.1.1 [127] Rtsne_0.16 digest_0.6.31 [129] shiny_1.7.4 GenomicRanges_1.50.2 [131] car_3.1-1 broom_1.0.2 [133] later_1.3.0 RcppAnnoy_0.0.20 [135] httr_1.4.4 colorspace_2.0-3 [137] rvest_1.0.3 fs_1.5.2 [139] tensor_1.5 reticulate_1.26 [141] IRanges_2.32.0 splines_4.2.1 [143] uwot_0.1.14 spatstat.utils_3.0-1 [145] graphlayouts_0.8.4 sp_1.5-1 [147] multtest_2.54.0 systemfonts_1.0.4 [149] plotly_4.10.1 sessioninfo_1.2.2 [151] xtable_1.8-4 jsonlite_1.8.4 [153] ggfun_0.0.9 R6_2.5.1 [155] profvis_0.3.7 pillar_1.8.1 [157] htmltools_0.5.4 mime_0.12 [159] glue_1.6.2 fastmap_1.1.0 [161] codetools_0.2-18 pkgbuild_1.3.1 [163] utf8_1.2.2 spatstat.sparse_3.0-0 [165] ggbeeswarm_0.6.0 curl_4.3.3 [167] leiden_0.4.3 survival_3.3-1 [169] repr_1.1.4 biomformat_1.26.0 [171] munsell_0.5.0 rhdf5_2.42.0 [173] GenomeInfoDbData_1.2.9 iterators_1.0.14 [175] haven_2.5.1 gtable_0.3.1